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Abstract 

Motivated by the problem of identifying correlations between 
genes or features of two related biological systems, we 
propose a model of feature selection in which only a subset 
of the predictors Xt are dependent on the multidimensional 
variate Y, and the remainder of the predictors constitute 
a "noise set" Xu independent of Y. Using Monte Carlo 
simulations, we investigated the relative performance of two 
methods: thresholding and singular-value decomposition, 
in combination with stochastic optimization to determine 
"empirical bounds" on the small-sample accuracy of an 
asymptotic approximation. We demonstrate utility of the 
thresholding and SVD feature selection methods to with 
respect to a recent infant intestinal gene expression and 
metagenomics dataset. 

1 Introduction. 

1.1 Motivation. Our study is motivated by the chal- 
lenge of performing an integrative analysis of a recent 
infant intestinal host-metabiome dataset [TS] . The data 
consists of microarray intensities for p = 585 genes, X, 
and next-gen sequencing hits for microbial DNA frag- 
ments organized into q = 211 subsystem classes, col- 
lected from stool samples of n = 6 newborn babies. 
Standard tests reveal conclusive evidence that the gene 
expression data and microbiome attributes are depen- 
dent [H]. The next objective is to qualify the detailed 
nature of this association; however, the high dimension- 
ality of the data poses a computational difficulty for 
modelling. In order to reduce the dimensionality of the 
data for initial exploratory modelling, it is necessary to 
employ feature selection to select a smaller subset of the 
genes. 

1.2 Background. Feature selection in the context of 
a univariate response has been extensively studied in the 
statistics and data mining literature However, much 
less has been done on feature selection for a multivariate 
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response vector. Group lasso [8^ has been studied as a 
feature selection method for multivariate linear regres- 
sion, but has been generally used for multi-task learn- 
ing. Sparse canonical correlation analysis P^|21|P^ 
has been proposed specially for high- throughput biolog- 
ical data. However, sparse CCA does not directly pro- 
duce a ranking of the features, but rather returns a list 
of genes of varying cardinality depending on tuning pa- 
rameters. Meanwhile, a factor-analysis-based model [T3] 
has been introduced as a bayesian version of canonical 
correlation analysis; however, the dimensionality of our 
data makes bayesian computation impractical. There- 
fore, in this paper, we study a siniplifcd version of sparse 
CCA which produces a ranking of the features, which 
we call the SVD method. 

1.3 Objectives The objectives of this current work 
are to develop tools for investigating of the perfor- 
mance of two feature selection methods (thresholding 
and SVD), and then to apply these tools to inform a 
integrative In section f|2]we propose a model for evalu- 
ating the performance of the feature selection methods, 
develop asymptotic tools for deriving analytical results, 
and investigate the effectiveness of the asymptotic ap- 
proximations using simulation. In section ^ we apply 
the thresholding and SVD methods to two sets of in- 
tegrated microarray-metagenomics data, and use sim- 
ulation results based on our proposed model to obtain 
required sampl size estimates for follow-up experiments. 
Further applications of the present body of work are dis- 
cussed in S|4] 

2 Methods and Technical Solutions. 

2.1 Feature Selection Model In our application, 
we hypothesized that associations between the host 
genes and bacteria gene expression levels are generally 
negligible, except for a small fraction of host genes and 
microbial gene categories with significant interaction. 
Therefore, in our model, we assume that the host genes 
with expression levels correlated with the expression 
levels of the microbial attributes form a small subset 
Xt of the host genes X, and that the rest of the host 
genes A„ are independent of the microbial attributes Y . 
It then follows that, letting X — (Aj, A„) without loss 



of generality, and also putting Y 
independent of X, we have 



(2.1) 
where 
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{Yt,Yu) where Yu is 



^XtVt = Cov {Xt,Yt), and where Xt = 
, . . . , /pj (Xp^ ) } , where pt is the dimension of Xt . 
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Further assuming that Cov{X) = Ip, Cov{Y) 
it follows that the covariance matrix of (X, Y) is 



(2.2) 



Now we consider feature selection algorithms which 
return a ranking ip : {1, . . . ,p} — >■ {1, . . . ,p} of the fea- 
tures in X. Here the ranking i/j is formally represented 
by a bijective map which associates to each ordinal rank 
1, . . . , p an index 1, . . . ,p of X (thus, we ignore the pos- 
sibility of ties.) Thus the feature X^(^i^ is interpreted 
as the "most promising feature." To formally evaluate 
ranking methods we use the 1-0 loss for the top-ranked 
feature: 



(2.3) 



L(V)=/(^(l)>ft), 



recalling that we arrange X as (Xt,Xu) so that 
Xi, . . . , Xp^ are correlated with Y. The rankings ip can 
be obtained from real- valued scores s:{l,...,p}— >M 
by letting 

(2.4) r'(») = E/(s(j) > + f]/(.(j) = si^)) 

i.e., ranking by scores and breaking ties in favor of the 
lowest index. 

2.2 Feature selection methods Perhaps the most 
straightforward ranking method is based on threshold- 
ing the elements of the covariance or correlation ma- 
trix SxY = Cov(X, Y) or Cor(X, Y): i.e.,defining the 
score as 



(2.5) 
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where SxiY is the ith row of Sxy- 

We also consider a ranking method which uses the 
singular-value decomposition of the cross-correlation or 
covariance matrix. Recall that the singular-value de- 
composition Sab = U DV^ is the unique matrix de- 
composition in which U and V are semiorthogonal, and 
D is diagonal with nonnegative entries in descending 



order. The first left singular vector mi is the first col- 
umn of [/, and we define the score based on the absolute 
values of the components of Ui : 



(2.6) 



ssvoii^ = \uu\. 



It is known that the left singular vector ui satisfies the 
criterion 

ui — argmax,jCou(Xu, Yw) subject to ||u||2 = 1, H^^lb = 1- 

In comparison, the classical technique of canonical 
correlation analysis |1] finds u, v which maximize 
Cor{'X.u,'Yv). However, the fact that canonical cor- 
relation analysis depends on inverting the inter-class 
sample covariance matrices Sx , Sy limits its applica- 
bility to data with small sample sizes. Meanwhile, 
the sparse canonical correlation analysis algorithm pro- 
posed by Witten[2T] proceeds by substituting Ip and Iq 
for Sx , Sy , but this can be easily seen to lead to an 
equivalent criterion to (2.7 1. However, Witten's algo- 
rithm allows for automatic inference of the number of 
significant features through the use of an additional £i 



penalty to (2.7), in contrast to our framework, in which 
we simply rank the features and leave to the user the 
decision of how many features to keep. For instance, in 
|3.4| we demonstrate the use of permutation-null derived 
false discovery rates for determining how many genes to 
report. 

At n = 2, due to the fact that the sample covariance 
matrix is rank 1, both the thresholding and SVD 
methods necesarily produce the same ranking. However, 
for n > 2, the rankings can differ. 

2.3 Asymptotics As our ultimate goal is to obtain a 
general understanding of optimal feature selection under 
our model, analytical results for the performance of all 
feature selection methods are indispensable. Analagous 
results have been obtained for sparse PCA [23] using 
asymptotics for n — oo and also for the joint limit 
n — > cxD,p — )> cxD. For the multivariate feature selection 
problem, a variety of asymptotic limits can be consid- 
ered: by increasing the sample size to infinity while also 
changing the number of correlated features, number of 
extraneous features, number of correlated or extraneous 
variates, or a number of combination of these. However, 
we find it most convenient to consider a limit in which 
the matrix Exy is shrunk to zero as the sample size 
increases. 

While we expect that the sample correlation matrix 
will be used more often than the sample covariance 
matrix in applications, the intractable distribution 
of the sample correlation matrix (TS] leads us to 
consider only the case in which S is sample covariance 



matrix. Then under our model, it is possible to obtain 
asymptotic independence of the entries of scaled sample 
cross-covariance matrix ^/nSxY by letting n — > oo 
while simultaneously allowing the covariance matrix S 
to change depending on the sample size. This result is 
stated below. 



(2.8) 
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and let SxYin) be the submatrix formed by the first 
p rows and the last q columns of S{n). Then as 
n — > oo, vec(y'TiSxYi'n)) converges in distribution to 
N{vec{y/nQQ),Ip ® Iq). 

Proof. Let \/nS{n) = (sij), T,{n) — (aij), then note 
that Cov{sij,Ski) = (Jikf^ji + cruajk (see [IT], p. 90). 
Recall that y/nSxy consists of the elements Sij where 
i < p and j > p. Thus Cov{sij , Ski) can be calculated 
by: 



Case 1. i 



k,j 
Tt 

and j > p, lim. 
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1 , while 



Cii = cTjk- Thus, Var{sij) = 1 + cr^;. But for i < p 
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(2.10) 



lim Var{sij) — ?► 1. 



Case 2. i ^ k or j ^ I. li i k, then aik = 0, since it 
lies off the diagonal of Ex = Ip- Similarly, ii j ^ I, 
then (jji — since it lies off the diagonal of Ey = Iq . 
Thus, Cov{sij, Ski) vanishes asymptotically. 

The result then follows from applying the multivariate 
central limit theorem. 

One can easily see that the matrix S(n) as defined 
above is positive semidefinite for n > uq ([4j). Note 
that while ^/nSxvin) converges to a distribution, the 
full matrix ^/nS{n) fails to converge in distribution since 
its diagonal elements tend to infinity. 

Now note that thresholding and SVD methods have 
the following expressions for the 1-0 loss when applied 
to matrix T = Sxy'- 



(2.11) 



Lthres{T) = /(argmaxjll(r^)illoo < Pt) 



and 
(2.12) 



Therefore we can compute the asymptotic approx- 
imation for the risk of the thresholding method as 
follows: 



Proposition 2.1. Let uq, fl, Sxy be defined as in 



Theorem 2.1 and let Lthres be defined as in (??). Then, 



Theorem 2.1. Let Q be a p x q real matrix. Define (2-13) 



lim E{Lthrcs{SxY)) 

n— >-oo 

(2.14) 



= / {p^q)F^2ixr-''-^f^Jl-l[l[F^.^^.ix))dx 

V ^=lj = l ^ 

where F^2^^{x) is the cdf of the noncentral chi-squared 
distribution with 1 degree of freedom and noncentrality 
parameter d > 0, F^2(x) — F^2q(x) and f^2 = 

-S^xM(^)- 

Proof. Observe that when Z ^ N{fi, I^n) 

m 

(2.15) Pr[\\Z\\^<x]^l[Pr[Zf <x% 

meaning that 
(2-16) 

'^iPthresiSxY)) = P'rW^ec{S XtY)\\oo < \\vec{Sx^Y Woo] 

can be calculated in terms of noncentral chi-squared 
distributions. 

In a similar way, bounds on 'E{LsvDiSxY) can be 
obtained by comparing the singular values of SxtY and 
Sx^Y- We also claim without proof that that such 
asymptotic approximations uniformly converge to the 
true risk function for fixed p, q, and < pt < p, as 
no tends to infinity, for any feature selection methods 
which follow the two conditions: 

• Monotonicity with respect to sample- size: 
(2.17) nL{SxY{n))) < E{L{SxY{n + 1))) 

• Monotonicity with respect to signal strength: For 



all n £ 



^pxq 



with \\n\\ < 1, defining E(A) 



Lsvd{T) /(argmaxj(argmax„||u T||)i|) < pt 



\riT '^T^] fc)^ positive constant A < 1, S{n, A) ^ 

A* * Iq J 

Wishart{n, ^T,{X)), and SxY{n,X) as the subma- 
trix comrpised of the first p rows and q columns of 
5(A), 

(2.18) E{L{SxY{n,m>E{L{SxY{n,fi))) 



Additionally we claim that the thresholding method 
and the SVD method both satisfy these monotonicity 
conditions. We postpone the technical justification of 
these claims for a forthcoming theoretical paper. 

To determine the small-sample validity of the 
asymptotic approximation obtained above, we use 
stochastic optimization applied to Monte Carlo simu- 
lations, as we discuss in the subsequent subsection. 

2.4 Computational Methods For simulation pur- 
poses we assume that X, Y have a multivariate joint nor- 



3. Form sample cross-covariance matrices S'^y by 
extracting the first p rows and last q columns of 
a Wishart(n — 1, ^^E^'^) matrix. 



mal distribution with the covariance matrix (2.2). To 
reduce the size of the parameter space, we set pt = qt 
and require that HxtYt be a random matrix parameter- 
ized by a single parameter, pt- Specifically, we let 



(2.19) 



^XtYt — G1DG2 



where Gi, G2 are independent random xpt orthogonal 
matrices and be a diagonal matrix with diagonal 
entries di , . . . , dp^ . 

The resulting model consists of four parameters: 

• n, the sample size 

• Pi, the number of correlated features and response 
variates 

• Pu, the number of extraneous features which are 
uncorrelated with the response 

• g„, the number of components in the response 
vector uncorrelated with the explanatory variate 

Under this model define the following functions of 
the parameters n,pt,Pu,qu- 



(2.20) 
(2.21) 
(2.22) 
(2.23) 
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Lsvd{Sxy)], 


E[l - 


Lsvd{Sxy)], 



i.e. P is the probability that the top-ranked feature is 
correlated to Y. 

Using monte carlo simulations we can approximate 
Pthres: PsvD, Pthres.PsvD by the foUowiug procedure 

1. For monte carlo trials i — l,...,mcres gener- 
ate independent random orthogonal matrices [17^ 
O'f^ , 02'^ and independent random diagonal matri- 
ces with uniform [0,1] entries D''^. 

2. Form population cross-covariance matrices by 



(i)T 



and population covariance 



matrices E^*' 



ip ^jcy 
T 

^XY ^9 y 



4. Form asymptotic sample cross-covariance matrices 
S^V by 



(2.24) vec{S'j^>y) ^ N{vec{Y.^^y), ^^Ip ® Ig)- 



5. For each '^'Vy^Vy ^PPY thresholding 
and SVD methods to obtain rankings 

7//*' ?//*^ 7//'^ 7//') 

V thresh VSVD^ ^thres^ VSVD 

6. Compute approximate values of 

Pthres, PSVD, Pthres, PsVD by 



(2.25) Pth res — 

(2.26) PsvD = 

(2.27) Pthres = 

(2.28) PsvD - 
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In this paper we use stochastic search techniques to 
calculate approximate bounds on 



(2.29) 
(2.30) 



max \Pthres — Pthr 



max \Psvd-Psvd\ 
Pt,Pu,qxi 



for fixed n. 

The problem of optimizing Pthres /svd- Pthres /svd 
over a three-dimensional discrete parameter 
space pt,Pu,qu can be handled via two different 
approaches [IB]: 

• Optimizing over a fixed grid of points {pt,Pu,qu) 
using sequential testing methods. 

• Using a stochastic analogue of gradient descent, 
stochastic approximation 

However, in order to take advantage of our massively 
parallel computing setup, we develop a population- 
based optimization technique which combines aspects 
of both approaches. The proposed algorithm is outlined 
below: 

Algorithm 1 

1. Given a random variable X\6 over a parameter 
space 9, we wish to find 

argmaxgE[JiL 16*] 



2. Starting with a grid of parameter values 9i, . . . , 9 kg , 
compute empirical means X\Oi, . . . ^ X\0kg using 
mCres repeated measurements at each parameter 
value 

3. At the tth step, let 9[, . . . , 6*,^^ be the m parameter 
values with the largest empirical means X among 
di, ■ ■ ■ , 9kt-i 

4. Update ^|6'* , . . . , X\9l^ with mCres additional mea- 
surements of X at each parameter value 

5. Generate 6'fct_i+i, • • • , fi'fcj by randomly perturb- 
ing 9l,...,9l^, and compute empirical means 
X\9kt_i+i, ■ ■ ■ , X\9kt using mCres repeated mea- 
surements at each parameter value. 



6. Repeat until step t final- 

Our optimization results are discussed in §2.5| 



2.5 Computational Results Table 1 provides the 
results obtained for n = 2 and n = 6. Note that 
standard errors for all probabilities are less than 0.002. 
In both cases we run Algorithm 1 for t final = 5 
steps, using mCres — 5000, fco — 500, with 9i,...,9ko 
being grid points over pt — {2, 3, 4, 5, 6},p„ = 
{l,6,...,41,46},g„ = {1,6, ...,41,46}, m = 10, and 
kt — kt-i = 100, with 6*^:^ ^-1.1, ■ ■ ■ ,0kt being generated 
by creating 10 perturbed copies of 9l~^ , . . . , 9{q^ with 
additive perturbations (61,62,63) where i5i is uniformly 
distributed over {—1, 0, 1} and ^2, 63 independently and 
uniformly distributed over {—3, —2, . . . , 2, 3}. 

Note from Table 1 that the maximum discrepancy 
between the asymptotic result and the true small- 
sample value decreases from n = 2 to n = 6 as 
might be expected. However, these results are far from 
exhaustive, and it remains to perform the optimization 
for larger values of n to confirm the apparent small- 
sample accuracy of the asymptotic approximation. 

3 Application. 

3.1 Summary In this section we apply the thresh- 
olding method and SVD method to select genes from 
a recent microarray-metagenomics dataset (^3.2). For 



each method we obtain a global permutation null dis- 
tribution to determine false discovery rates for the cor- 
responding ranked list of genes (^3.4). 



We use these q- values as a basis to determine which 
of the resulting rankings to use and to select how many 
genes to report from that ranked lists (j ]3.5[ ). The 
strongest results are obtained from applying the SVD 
method to the formula-fed data, which accords with 
our simulation results indicating the relative strength of 
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Table 1: Stochastic optimization results for n = 2 and 
n = 6 



SVD for low sample sizes and with previous observations 
of the relative homogeneity of the formula-fed data. 
Based on a q- value cutoff of 0.15 we end up reporting ten 
genes: MMD, PPP3CA, AL0X5, PAFAH2, C1QTNF6, 
MSRB3, VTN, ACVRIB, WASL, and MET. To investi- 
gate the validity of the resulting q- values, we check our 
results against rankings of genes from the thresholding 
and SVD methods combined with an alternative permu- 
tation null. We observe that although higher q-values 
result from the local null, the rankings of genes result- 
ing from SVD applied to the formula-fed data with the 
global null and the local nulls have high overlap. In 
particular, PPP3CA and AL0X5 are top-ranked genes 
in both procedures. We then apply the SVD proce- 
dure to identify metabiome attributes associated with 
the ten selected genes, but none of the metabiome at- 
tributes are found to be siginificantly associated with 
the selected genes. 

We discuss possible biological interpretations of 



these findings in ^3.6 



For the purpose of determining the sample size 
needed for a follow-up study, in §3.7| we find the 
simulated performance of the thresholding and SVD 
method as the sample and pt, the true number of 
correlated genes and metabiome features, are varied. 
From these results it is clear that while the SVD 
method dominates the thresholding methods at low 
sample sizes, the thresholding method rapidly improves 
in performance as sample size increases and as pt, 
the number of correlated genes, increases. Yet even 
under the most favorable conditions it appears that a 



sample size of around 100 is required for reliable feature 
selection under our model, for p = 600 and q — 200. 

3.2 Dataset The data originates from an experiment 
to study the effect of breast-feeding versus formula- 
feeding on infant health. Stool samples were collected 
from six breast-fed babies and six formula-fed babies, 
and gene expression levels were obtained via microarray 
intensities of host mRNA fragments isolated from the 
stool sample, while bacterial microbiome subsystem 
profiles were obtained by aggregating the fragments 
detected by metagenomic pyrosequence according to the 
three- level MG-RAST annotation [2]. 

Previous analyses characterised differences between 
the gene expression levels of the two treatment groups 
[S] and multivariate relationships between the host 
expression levels and microbiome attributes which were 
potentially induced by the differences between treament 
groups [TS]. The current study is motivated by the 
goal of identifying mutalistic relationships between the 
host and the intestinal microbiome on the basis of 
the microarray-metagenomics expression data for each 
treatment group seperately. 

3.3 Preprocessing. As per the suggestions in |15) , 
we focus on the immunology-related genes, producing a 
data matrix of 6 observations by 585 genes, ^raw We 
select the microbial attributes with read counts higher 
than 300, resulting in a data matrix of 6 observations 
by 211 microbial feature hit counts for each treatment 
group, Yraw We apply loess normalization to the 
log-transforms of the raw intensities in Xj-am |15j . 
standardize the rows and columns of 'Kj-aw and Yraw 
to have mean and variance 1 as described in [7 to 
arrive at the processed matrices X. The hit counts in 
Yraw are converted to proportions by individuals, then 
log-transformed, then row and column standardized to 
produce Y. 

3.4 Procedure We form Sxy — Cov(X,Y) and 

apply the thresholding and SVD methods to rank the 
genes in X. 

We also obtain false discovery rates (g-values) for 
each method by using a global row-wise permutation 
null ditribution and prior false positive rate ttq = 1 [7]. 

For each method we compute a separate p-value 
Pthres 1 Psv D for cach gene via a global permutation null 
distribution for the scores Sthres , ssvd of the individual 
genes by the following: 

1. Let Sthresii) be the score of the jth gene according 
to thresholding, as from (2.5), and ssvdH) be the 
score of the jth gene according to SVD, as from 

(2:6|. 



2. For repetitions i — 1, . . . with = 1000, 
form permuted data matrix Y^*' by independently 
permuting each row of Y. Then form cross- 
correlation matrices S'3s:y from Cor(X, Y). 

3. Compute scores sl^^^es ^^'^ ^^svd from S'^yi). 

4. Compute the p-values of the jth gene according to 
thresholding and SVD as: 

(3.31) 

"V^' V- ^i^thresU) < sites (fc)) 
Pthres (3) ^ 2^ 2^ 



= 1 k=l 



'I^CresP 



(3.32) 

Psvnij)^j:±'^''''^'^-''^-^^ 

i=l k=l 



^^CresP 



Next, let Tthres{j) be the ranking of the jth gene in 
ascending order of the pt/ires-values, and let Tsvoij) 
be the ranking of the jth gene in ascending order 
pf the p5yD-values, with ties broken in favor of the 
lowest index. Note that t = -0^^ when a global null 
distribution is used. Compute the false discovery rates 
qthresii) and qsvdii) as 



(3.33) 



Qthres 



(3.34) qsvoU) = mp{psvDij))/TsvD{j) 

where m is a correction factor for dependence [Hj, 



(3.35) 



m 



J 



which evaluates to 6.95 for p — 585. 

For comparative purposes we compute alternate p- 
values Pthres, PsvD according to a local permutation 
null distribution. Note that resulting ascending rank- 
ing of Pthres, PSVD may differ from ^pthres,^svD respec- 
tively, since each gene has a unique null distribution. 
The procedure is as follows: 

1. For repetitions i — 1, . . . with = 1000, 

form permuted data matrix Y^*' by permuting 
the row labels of Y. Then form cross-correlation 
matrices S'^^y fr'om Cor(X,Y). 



2. Compute scores sj^g^ and Sgy^, from S'^yi). 

3. Compute the p- values of the jth gene according to 
thresholding and SVD as: 



{i) 



(3.36) PthresU) = X 



I{sthres{i) < SthresU)) 



(3.37) PsvoiJ) 



rncres 

E 

1=1 



thres 



Tabic 2: Breast-fed data: Global null results 



From these p-values p we obtain alternate rankings 
f^^ for thresholding and SVD. We discuss the rank ings 



ipthres and tpsvo, 



^SVD 



for the formula-fed data in S 3.5 



3.5 Results Tabic 2 provides the top three genes 
identified by thresholding and SVD applied to the 
breast-fed data along with q-values obtained from the 



global permutation null {{ 3.4 1, and Table 3 provides the 
analagous results for the formula-fed data. Note that q- 
values for the SVD method can exceed 1 due to the 
correction factor for dependence. 

Note that while thresholding has comparable q- 
values for the breast-fed and formula-fed data, the 
SVD method produces extremely weak q-values for the 
breast-fed data but extremely strong q-values for the 
formula-fed data. This discrepancy in performance 
may be due to the increased variability in the gene 
expression levels for the breast-fed data, as observed 
in [6] through examination of the raw intensities of 
"housekeeping genes" for the formula-fed and breast- 
fed data. Furthermore, it is already clear from Tables 2 
and 3 that SVD applied to the formula-fed data has the 
strongest results overall. Table 4 provides the entire list 
of genes produced by the SVD method applied to the 
formula-fed data with a q-value less than 0.15. It is also 
worth noting that PPP3CA and PAFAH2 are common 
to both the top 10 genes for the thresholding and SVD 
method; what is not shown is that there are no other 
commonalities to the top 10 genes list. 

Table 4 provides the alternate p-values psvD com- 
puted for the SVD method applied to the formula-fed 
data using the local permutation null described in §3.4| 
The rankings i^svD and Tgyu have high overlap in the 
sense that 7 of the top 10 genes in tp are also among 
the top 10 genes in Tgy^: namely: MMD, PPP3CA, 
AL0X5, PAFAH2, C1QTNF6, VTN, and ACVRIB. In 
particular, PPP3CA nad AL0X5 are in the top 3 genes 
in both permutation nulls. 

From these results we judge it appropriate to select 
the top ten genes resulting from SVD applied to the 
formula-fed data for further analysis. 

In order to identify the metabiome attributes most 
closely associated with these ten genes, we let X be the 
metabiome data and Y be the intensities for the ten 
selected genes, and apply SVD-based feature selection. 
The results are listed in Table 5. The first column of 
Table 5 provides the name of first SEED hierachy of the 
microbial attribute, which is the broadest categorization 
in the MG-RAST SEED annotation scheme. The sec- 
ond column is the name of the MG-RAST subsystem 



# 


name 


Qthres 


name 


QSVD 


1 


THBS2 


0.38 


GBPl 


3.78 


2 


FYN 


0.28 


TNFAIP8L1 


2.27 


3 


CRNN 


0.39 


TYROBP 


1.86 



# 


name 


Qthres 


name 


qsvD 


1 


PPARA 


0.15 


MMDl 


0.00 


2 


PPP3CA 


0.86 


PPP3CA 


0.00 


3 


SDC4 


0.39 


AL0X5 


0.00 


4 


PAFAH2 


0.70 


PAFAH2 


0.00 



annotation, the finest level of the hierarchical SEED 
annotation scheme and the level chosen for data aggre- 
gation. While the q-values are very weak, it is worth 
noting that two of the top five attributes belong to the 
virulence category, since only 11 of the 211 microbial 
attributes belong to the virulence category. While two 
of the top five attributes also belong to the carbohy- 
drates category, this is less interesting since a total of 
42 out of 211 of the microbial attributes belong to the 
carbohydrates category. 



3.6 Discussion The results of our analysis suggest 
that the gene PPP3CA merits further investigation. 
While we could not conclusively determine which of the 
metabiome attributes were associated with PPP3CA, 
we have relatively high confidence that PPP3CA is cor- 
related with the metabiome attributes since the gene is 
highly ranked by multiple methods. The gene PPP3CA 
codes for the enzyme calcineurin, which generates a sig- 
nal activating the gut immune system 19J. One of cal- 



Table 4: Fomula-fed data: SVD results 



name 


T 


q 


f 


P 


MMD 


1 


0.00 


5 


0.002 


PPP3CA 


2 


0.00 


1 


0.000 


AL0X5 


3 


0.00 


2 


0.000 


PAFAH2 


4 


0.00 


6 


0.004 


C1QTNF6 


5 


0.00 


10 


0.011 


MSRB3 


6 


0.00 


11 


0.011 


VTN 


7 


0.00 


3 


0.002 


ACVRIB 


8 


0.00 


4 


0.002 


WASL 


9 


0.08 


27 


0.040 


MET 


10 


0.11 


14 


0.013 



Table 5: Metabiome attributes associated with selected 
genes 



SEED 1 


name 


TSVD 


qsvD 


Carb. 


Se.-giyox. cycle 


1 


0.31 


Phos. 


Control. PHO 


2 


0.31 


Viru. 


CoZnCd res. 


3 


0.28 


Carb. 


Beta-Gl. met. 


4 


0.43 


Viru. 


Res. fluoroq. 


5 


0.47 



cineurin's specific functions is to dephosphorylate NEAT 
transcription factors to promote immune activation [14j . 

The genes AL0X5 and PAEAH2 were also selected 
by more than one feature selection method. In addition, 
AL0X5 was also selected in a previous study on the 
combined formula-fed and breast-fed data [15]. The 
gene AL0X5 codes for arachidonate 5-lipoxygenase, 
which is involved in mucosal inflammatory responses [5] . 

While the results of the metabiome attribute selec- 
tion were much weaker than the results of the feature 
selection for the genes, it is intriguing that two of the 
top five metabiome attributes were virulence-related: 
namely, cobalt-zinc-cadmium resistance and resistance 
to fluoroquinolones. Correlations between the immu- 
nity and defense-related host genes and the virulence 
attributes would agree with the biological intuition that 
the host would react to pathogens in the instestine; or 
that conversely, that pathogenic activity may increase 
as a result of inhibited host immunodeficiency. 

3.7 Simulation results In Eigure 1 we show simu- 
lated results for Pthres and Psvd for p = 600,(7 = 200 
and pt varying from 10 to 100, n varying from 2 to 100. 
The height of the dark grey bars is the Pthres and the 
height of the light grey bars is Psvd from to 1. The 
axis with the rising slope is the axis for pt, taking values 
(100, 90, ... , 10) from left to right. The axis with falling 
slope is for n taking values from (10, 20, ... , 100) from 
left to right. We used = 40000 monte carlo tri- 

als for each parameter value; thus the standard errors 
< 0.025 result in confidence bounds which are too small 
to be visible. 

From the simulation we conclude that for plausible 
values of pt , the top ranked gene via thresholding (or 
SVD) is a false positive with probability exceeding 0.9. 
However, SVD is indeed more effective than threshold- 
ing at n — 12. But as we can see, as the sample size 
increases, the thresholding method rapidly climbs in rel- 
ative effectiveness. At n = 70, the thresholding method 
has a higher probability of assigning the top ranking to 
a correlated gene than the SVD method for pt < 50. 
At n = 100, the thresholding method outperforms the 
SVD method for pt < 90, which encompasses most of 



Eigure 1: Simulated performance of thresholding (dk. 
grey) versus SVD (h. grey) for p = 600, q = 200 



100 




the biologically plausible range for pt. Of note is the 
nonmonotonicity of the thresholding method with re- 
spect to Pt for fixed Ti,p, q: while both SVD and thresh- 
olding increase in effectiveness for increasing pt when 
Pt is large, thresholding experiences a dramatic increase 
in effectiveness for decreasing pt when pt is small. Yet 
even under the best plausible conditions, with pt — 10 
for thresholding, a minimum sample size of 100 is re- 
quired for the top-ranked feature to be correlated to Y 
even 80 percent of the time. 

These significant discepancies in performance, how- 
ever, would seem to indicate that neither the thresh- 
olding method nor the SVD method can be claimed to 
be the "optimal" method, and that there may exist an 
as-of-a yet undiscovered method which dominates both 
of these simple approaches. 

4 Impact and Significance 

Our simulation results succeed in providing a basic un- 
derstanding of the differences between the threshold- 
ing and SVD methods. To our knowledge, such a com- 
parative study of multivariate feature selection methods 
has never appeared in the literature. In addition, our 
model allows for the quantitative analysis of experimen- 
tal design considerations. Researchers desiring an un- 
derstanding of an integrated biological system can use 
the model proposed in the paper to determine the rel- 
ative value of additional observations versus measure- 
ments of additional biological features (depth versus 
breadth). This approach provides an appreciation for 
the importance of having prior knowledge that can al- 
low for elimination of extraneous features or variates. 

With respect to the original problem which mo- 
tivated this work, our data analysis diagnostics and 
simulation results demonstrate that singular value de- 
composition is an effective tool for identifying correla- 



tions between genes and microbial attributes for small- 
sample microarray-metagenomics datasets. Our data 
analysis of the infant microarray-metagenomics dataset 
indicate that the combination of SVD-based feature 
selection with permutation-null-derived false discovery 
rates provides a powerful framework for inferring host- 
microbiome interactions. 

While we only scratch the surface of the multivari- 
ate feature selection problem in this paper, by the same 
token, the tools we introduce can be employed in fur- 
ther studies on multivariate selection. The asymptotic 
approximation for the sample cross-covariance matrix in 
our model can be used for any feature selection method 
to be studied using our model. We demonstrate how 
stochastic optimization can be used to evaluate the ac- 
curacy of the asymptotic approximation. In addition, 
the same stochastic optimization techniques can be used 
to compare the performances of two competing feature 
selection methods. 

It would be interesting to compare the performance 
of group lasso, sparse CCA and bayesian approaches 
to feature selection under our proposed model. In 
particular, we expect our results on the SVD method 
to generalize to the performance of sparse CCA feature 
selection methods due to the similarity between the 



algorithms (^2.2). Based on our simulation results, we 
predict that the thresholding method also outperforms 
the sparse CCA method as the sample size increases. 
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